Objective

The objective of this project is to predicts the price and returns of crude oil front month futures contracts. To this end, I select a cost function, build multiple models to identify parameters, use the models to makes predictions in the testing data, and then I use the cost function to rank the models.

Structure of report:

  1. Selection of a cost function

  2. Overview of the data

  3. Preliminary visualizations

  4. Preliminary models

  5. Predictions in the testing data

  6. Ranking models / selection of best model

  7. Final visualizations

Cost Function

Assumptions used to identify a cost function:

Given the above assumptions, the criteria for the cost function is that it should be exponential and symmetric. For this reason, I select the standard average of Squared Errors as the cost function.

The Data

I have collected the data from two sources:

After loading the data, I merged the data and made further adjustments. One adjustement of relevance relates to the treatment of ‘NA’s. While the price data is available on a daily basis, other variables were available on a weekly, monthly, or yearly basis only. In this case, I filled the ’NA’ observations with the average value for the time period. It is worth noting that a better approach would have been to estimate the missing data points using available data and seasonality patterns.

I use data from 2003-12-01 to 2015-11-17 for the analysis, but show some preliminary plots of earlier data from 1983.

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

The Dependent Variable

My dependent variable is daily return on the price of crude oil fronth month active contracts traded in New York Mercantile Exchange (NYMEX). I will also show predictions and visualizations of the price of the contract, though my central aim is to predict the returns.

The data available from EIA website on the price of active crude oil contracts only shows one price per day. Therefore, the returns that I calculate are inter -day returns. It would have been better to obtain opening and closing prices and calculate intra -day returns, but this data was not available on the EIA website. (Note however: the data is available on Columbia’s Bloomber terminals).

Note: Futures contracts prices reflect expectations of the price of the underlying product (in this case, crude oil) at a future point in time (delivery date). Multiple contracts for different future time periods trade at the same time. Typically, the contract that is closer to expiry (called ‘front month’) exhibits the highest trading volume.

Here I calculate prior day price and the interday return:

df$n <- 1:nrow(df)
df$previous_price <- df$Price[ifelse(df$n==1,df$n, df$n-1)]
df$return <- (df$Price - df$previous_price)/ df$previous_price *100
df <- df[ , -17]
df <- df[, c(18,2,1,3:17)]

The Independent Variables

The variables obtained from the above sources are described below. I also contruct additional variables.

Description of Variables

Obtained from the EIA

comm_stocks Weekly U.S. Ending Stocks excluding SPR of Crude Oil (Thousand Barrels). Frequency of data: weekly. Source: EIA.

cost U.S. Nominal Cost per Foot of Crude Oil Wells Drilled (Dollars per Foot). Frequency of data: annual.

reserves U.S. Crude Oil Proved Reserves (Million Barrels). Frequency of data: annual.

spr U.S. Ending Stocks of Crude Oil in Strategic Petroleum Reserves (SPR) (Thousand Barrels). Frequency of data: weekly.

us_production U.S. Field Production of Crude Oil (Thousand Barrels per Day). Frequency of data: monthly.

Obtained from WRDS

consumption Consumption of Oil in the United States (’000 barrers per day), EIA

indstr_prod Industrial Production, Preliminary Series, SA.

cpi Consumer price index - percent change year ago period, NSA

unempl Unemployment rate (%), SA

fed.reserves Total reserves (mil us$) - end of period

ppi Producer Price Index: Depository Credit Intermediation, NSA

indstr.prod2 Industrial Production Index, SA

cad.usd.exch.bid Exchange rate (IDC) - London market close - Forward - BID - 30 day - CDN$/USD

cad.usd.exch.ask Exchange rate (IDC) - London market close - Forward - ASK - 30 day - CDN$/USD

Constructed Variables

I construct the following variables:

  1. Days until contract expiry.

  2. A Fourier series to model seasonality.

Days until contract expiry

I use the below definition of Contract 1 (a.k.a front month) to calculate the expiry date:

Contract Definition: A futures contract specifying the earliest delivery date. For crude oil, each contract expires on the third business day prior to the 25th calendar day of the month preceding the delivery month. If the 25th calendar day of the month is a non-business day, trading ceases on the third business day prior to the business day preceding the 25th calendar day. After a contract expires, Contract 1 for the remainder of that calendar month is the second following month.

Day on the 25th 3 business days before Difference
Sunday Tuesday 5 days
Saturday Tuesday 4 days
Friday Tuesday 3 days
Thursday Monday 3 days
Wednesday Friday 5 days
Tuesday Thursday 5 days
Monday Wednesday 5 days

Identify expiry date

date <- df$date
df2 <- data.frame(date)
df2$month <- as.numeric(format(df2$date, "%m"))
df2$year <- as.numeric(format(df2$date, "%Y"))
df2$day <- as.numeric(format(df2$date, "%d"))
df2$new_date <- as.Date(paste0(df2$year,"-",df2$month,"-25"))
df2$weekday <- as.numeric(format(df2$new_date, "%w"))
df2$days_to_subtr <- ifelse(df2$weekday <= 3, 5, ifelse(df2$weekday == 6, 4, 3)) # Note: Sunday corresponds to weekday 0.
df2$expiry1 <- df2$new_date - df2$days_to_subtr

# Check weekday of expiry day (should only be within 1 and 5)
summary(as.numeric(format(df2$expiry1, "%w")))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.697   4.000   5.000
# Expiry 2
df2$new_date2 <- as.Date(paste0(df2$year,"-",df2$month + 1,"-25"))
df2$weekday2 <- as.numeric(format(df2$new_date2, "%w"))
df2$days_to_subtr2 <- ifelse(df2$weekday2 <= 3, 5, ifelse(df2$weekday2 == 6, 4, 3)) # Note: Sunday corresponds to weekday 0.
df2$expiry2 <- df2$new_date2 - df2$days_to_subtr2
summary(as.numeric(format(df2$expiry2, "%w")))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   2.000   2.000   2.748   4.000   5.000     665
# Selecting correct expiry day
df2$comparison_day <- as.numeric(format(df2$expiry1, "%d"))
df2$expiration_date <- as.Date(ifelse(df2$day > df2$comparison_day, df2$expiry2, df2$expiry1))

df2$days_until_expiry <- as.numeric(df2$expiration_date-df2$date)
df$expiry_date <- df2$expiration_date
df$days_util_expiry <- df2$days_until_expiry
rm(df2)
rm(date)

df <- df[, c(1:3,19:20,4:18)]

Modelling seasonality

I use a fast Fourier series to model seasonality, as follows.

df$month <- as.numeric(format(df$date, "%m"))
df$sin1 <- sin(2*pi*df$month/12)
df$cos1 <- cos(2*pi*df$month/12) 
df$sin2 <- sin(4*pi*df$month/12)
df$cos2 <- cos(4*pi*df$month/12) 
df$sin3 <- sin(6*pi*df$month/12)
df$cos3 <- cos(6*pi*df$month/12) 
df$sin4 <- sin(8*pi*df$month/12)
df$cos4 <- cos(8*pi*df$month/12) 
df$sin5 <- sin(10*pi*df$month/12)
df$cos5 <- cos(10*pi*df$month/12) 

Preliminary Visualizations

Note: Due to some later conflicts between packages, I visualize a non-linear generalized additive model here (though it is more relevant later).

df2 <- df[complete.cases(df),]
training2 <- df2[df2$date < "2014-01-01",]
testing2 <- df2[df2$date >= "2014-01-01",]

library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.12
gam1 <- gam::gam(return ~ s(date) + s(reserves) + s(comm_stocks) + s(spr) + s(us_production) + s(cost)  + s(indstr_prod) + s(unempl) + s(cad_usd_exch_bid) + s(cpi) + s(reserves_fed) + s(ppi) + s(indstr_prod2) + s(month), data = training2)

par(mfcol = c(2,2))
plot(gam1, residuals = TRUE, pch =".", rugplot = FALSE)
## Warning in gplot.default(x = structure(c(12387, 12388, 12389, 12390,
## 12391, : The "x" component of "s(date)" has class "Date"; no gplot()
## methods available

par(mfcol = c(2,1))

Returns and price overtime

Returns, Industrial production, and cost of drilling

library(ggplot2)
qplot(data = df, x = date, y = return, colour = cost, lwd = indstr_prod, xlab = "Time", ylab = "Crude Oil Active Future Contract ($ / Barrel)") + ggtitle("Returns, Industrial production, and cost of drilling") + geom_smooth()
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 9 rows containing missing values (geom_point).

Price, Industrial production, and cost of drilling

qplot(data = df, x = date, y = Price, colour = cost, lwd = indstr_prod, xlab = "Time", ylab = "Crude Oil Active Future Contract ($ / Barrel)") + ggtitle("Price, Industrial production, and cost of drilling") + geom_smooth()
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 9 rows containing missing values (geom_point).

Return and selected determinants

chart_return <- function(x, xlab, y = df$return, ylab = "Returns"){
  plot(y = y, x = x, xlab = xlab, ylab = ylab, pch = 1)
  abline(lm(y ~ x), col = "blue", lwd = 1)
}

par(mfcol = c(2,2))

chart_return(df$expiry_date, "Expiry Day of Contract")
chart_return(df$days_util_expiry, "Number of days until expiry")
chart_return(df$reserves, "U.S. Crude Oil Proved Reserves (Million Barrels)")
chart_return(df$comm_stocks, "Ending Stocks excluding SPR of Crude Oil (Thousand Barrels)")

chart_return(df$unempl, "Unemployment rate (%)")
chart_return(df$cad_usd_exch_bid, "CDN$/USD exchange rate - 30 day BID")
chart_return(df$cad_usd_exch_ask, "CDN$/USD exchange rate - 30 day ASK")
chart_return(df$previous_price, "Previous day price")

chart_return(df$month, "Month")
chart_return(df$cos1, "First element of the seasonality pattern (cos1)")

Price and selected determinants

chart <- function(x, xlab, y = df$Price, ylab = "Price"){
  plot(y = y, x = x, xlab = xlab, ylab = ylab, pch = 1)
  abline(lm(y ~ x), col = "blue", lwd = 1)
}

par(mfcol = c(2,2))

chart(df$expiry_date, "Expiry Date of the Contract")
chart(df$reserves, "U.S. Crude Oil Proved Reserves (Million Barrels)")
chart(df$us_production, "U.S. Field Production of Crude Oil (Thousand Barrels per Day)")
chart(df$spr, "U.S. Ending Stocks of Crude Oil in Strategic Petroleum Reserves (SPR) (Thousand Barrels)")

chart(df$comm_stocks, "Ending Stocks excluding SPR of Crude Oil (Thousand Barrels)")
chart(df$indstr_prod, "Industrial Production, Preliminary Series")
chart(df$cpi, "Consumer price index - percent change year ago period")
chart(df$ppi, "Producer Price Index")

Preliminary Models

Training vs. Testing Data

df <- df[complete.cases(df),]
training <- df[df$date < "2014-01-01",]
testing <- df[df$date >= "2014-01-01",]

Linear models

Bivariate linear regression

lm0 <- lm(return ~ date, training)
lm0b <- lm(Price ~ date, training)

library(stargazer)
## 
## Please cite as: 
## 
##  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2. http://CRAN.R-project.org/package=stargazer
stargazer(lm0, lm0b, type = 'text')
## 
## ============================================================
##                                     Dependent variable:     
##                                 ----------------------------
##                                     return         Price    
##                                      (1)            (2)     
## ------------------------------------------------------------
## date                               -0.00004      0.015***   
##                                   (0.00004)      (0.0003)   
##                                                             
## Constant                            0.674       -138.259*** 
##                                    (0.634)        (4.367)   
##                                                             
## ------------------------------------------------------------
## Observations                        2,466          2,466    
## R2                                  0.0004         0.496    
## Adjusted R2                        -0.00003        0.496    
## Residual Std. Error (df = 2464)     2.340         16.113    
## F Statistic (df = 1; 2464)          0.928      2,428.500*** 
## ============================================================
## Note:                            *p<0.1; **p<0.05; ***p<0.01
p <- qplot(data = df, x = date, y = return, xlab = "Time", ylab = "Returns")
coef_lm0 <- coef(lm0)
p + geom_abline(intercept = coef_lm0[1], slope = coef_lm0[2], col = "green", lwd = 1)

p <- qplot(data = df, x = date, y = Price, xlab = "Time", ylab = "Price ($ / Barrel)")
coef_lm0b <- coef(lm0b)
p + geom_abline(intercept = coef_lm0b[1], slope = coef_lm0b[2], col = "green", lwd = 1)

Multivariate linear regressions

Adding seasonality to the model

lm1a <- lm(return ~ date + sin1 + cos1 + sin2 + cos2 + sin3 + cos3 + sin4 + cos4 + sin5 + cos5, training)
lm1b <- lm(Price ~ date + sin1 + cos1 + sin2 + cos2 + sin3 + cos3 + sin4 + cos4 + sin5 + cos5, training)

lm1a_subset <- step(lm1a, trace = FALSE)
lm1b_subset <- step(lm1b, trace = FALSE)

ggplot(data = training, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = lm1a_subset$fitted.values), color = "green")

ggplot(data = training, aes(x=date))+
geom_line(aes(y = Price)) +
geom_line(aes(y = lm1b_subset$fitted.values), color = "green")

Adding oil fundamentals and economic indicators

lm2a <- lm(return ~ . + comm_stocks * cost  + reserves * us_production - previous_price - Price, training)
lm2b <- lm(Price ~ . + comm_stocks * cost  + reserves * us_production - previous_price - return, training)

# next command is needed for later
X_test <- model.matrix(return ~ . + comm_stocks * cost  + reserves * us_production - previous_price - Price, data = testing)

lm2a_subset <- step(lm2a, trace = FALSE)
lm2b_subset <- step(lm2b, trace = FALSE)

ggplot(data = training, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = lm2a_subset$fitted.values), colour = "green" )

ggplot(data = training, aes(x=date))+
geom_line(aes(y = Price)) +
geom_line(aes(y = lm2b_subset$fitted.values), colour = "green" )

Including prior-day price

lm3a <- lm(return ~ . + comm_stocks * cost  + reserves * us_production - Price, training[,-28])
lm3b <- lm(Price ~ . + comm_stocks * cost  + reserves * us_production - return, training[,-28])

lm3a_subset <- step(lm3a, trace = FALSE)
lm3b_subset <- step(lm3b, trace = FALSE)

ggplot(data = training, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = lm3a_subset$fitted.values), colour = "green" )

ggplot(data = training, aes(x=date))+
geom_line(aes(y = Price)) +
geom_line(aes(y = lm3b_subset$fitted.values), colour = "green" )

Summary of models so far

stargazer(lm1a, lm2a, lm3a_subset, type = 'text')
## 
## ===========================================================================================
##                                                Dependent variable:                         
##                        --------------------------------------------------------------------
##                                                       return                               
##                                 (1)                   (2)                    (3)           
## -------------------------------------------------------------------------------------------
## date                         -0.00004               -0.003                                 
##                              (0.00004)              (0.005)                                
##                                                                                            
## expiry_date                                          0.003                 0.001***        
##                                                     (0.005)                (0.0003)        
##                                                                                            
## days_util_expiry                                                                           
##                                                                                            
##                                                                                            
## reserves                                            0.0002                -0.0001***       
##                                                    (0.0002)               (0.00004)        
##                                                                                            
## comm_stocks                                        -0.00001               -0.00001**       
##                                                    (0.00001)              (0.00000)        
##                                                                                            
## cost                                                -0.005                                 
##                                                     (0.008)                                
##                                                                                            
## us_production                                       0.002*                 0.0004**        
##                                                     (0.001)                (0.0002)        
##                                                                                            
## spr                                                 0.00000                                
##                                                    (0.00001)                               
##                                                                                            
## consumption                                         -0.001                                 
##                                                     (0.001)                                
##                                                                                            
## indstr_prod                                         -1.198                                 
##                                                     (2.365)                                
##                                                                                            
## unempl                                              -0.128                 -0.105**        
##                                                     (0.252)                (0.050)         
##                                                                                            
## cad_usd_exch_bid                                   901.135**             1,047.993***      
##                                                    (421.755)              (332.795)        
##                                                                                            
## cad_usd_exch_ask                                  -901.005**            -1,052.777***      
##                                                    (421.817)              (332.929)        
##                                                                                            
## cpi                                                  0.015                                 
##                                                     (0.077)                                
##                                                                                            
## reserves_fed                                       -0.00000                                
##                                                    (0.00001)                               
##                                                                                            
## ppi                                                  0.010                                 
##                                                     (0.027)                                
##                                                                                            
## indstr_prod2                                        -2.068                                 
##                                                     (6.625)                                
##                                                                                            
## previous_price                                                            -0.031***        
##                                                                            (0.006)         
##                                                                                            
## month                                               -0.047                 -0.032**        
##                                                     (0.098)                (0.015)         
##                                                                                            
## sin1                          0.111*                -0.122                                 
##                               (0.066)               (0.372)                                
##                                                                                            
## cos1                          -0.063                 0.004                -0.279***        
##                               (0.068)               (0.133)                (0.094)         
##                                                                                            
## sin2                           0.069                 0.011                                 
##                               (0.066)               (0.182)                                
##                                                                                            
## cos2                          -0.031                -0.023                                 
##                               (0.067)               (0.117)                                
##                                                                                            
## sin3                          -0.055                -0.072                                 
##                               (0.066)               (0.123)                                
##                                                                                            
## cos3                          -0.057                -0.031                                 
##                               (0.067)               (0.116)                                
##                                                                                            
## sin4                           0.018                 0.010                                 
##                               (0.066)               (0.087)                                
##                                                                                            
## cos4                           0.007                 0.067                                 
##                               (0.067)               (0.121)                                
##                                                                                            
## sin5                          -0.037                -0.052                                 
##                               (0.066)               (0.076)                                
##                                                                                            
## cos5                           0.004                 0.039                                 
##                               (0.068)               (0.120)                                
##                                                                                            
## comm_stocks:cost                                    0.00000                                
##                                                    (0.00000)                               
##                                                                                            
## reserves:us_production                             -0.00000                                
##                                                    (0.00000)                               
##                                                                                            
## Constant                       0.578                 7.765                  3.251          
##                               (0.637)              (20.589)                (2.717)         
##                                                                                            
## -------------------------------------------------------------------------------------------
## Observations                   2,466                 2,466                  2,466          
## R2                             0.003                 0.010                  0.015          
## Adjusted R2                   -0.001                -0.002                  0.011          
## Residual Std. Error      2.342 (df = 2454)     2.342 (df = 2436)      2.327 (df = 2455)    
## F Statistic            0.688 (df = 11; 2454) 0.819 (df = 29; 2436) 3.812*** (df = 10; 2455)
## ===========================================================================================
## Note:                                                           *p<0.1; **p<0.05; ***p<0.01
stargazer(lm1b, lm2b, lm3b_subset, type = 'text')
## 
## ============================================================================================================
##                                                         Dependent variable:                                 
##                        -------------------------------------------------------------------------------------
##                                                                Price                                        
##                                   (1)                         (2)                           (3)             
## ------------------------------------------------------------------------------------------------------------
## date                            0.015***                    -0.029**                                        
##                                 (0.0003)                    (0.013)                                         
##                                                                                                             
## expiry_date                                                 0.051***                     0.001***           
##                                                             (0.013)                      (0.0002)           
##                                                                                                             
## days_util_expiry                                                                                            
##                                                                                                             
##                                                                                                             
## reserves                                                   -0.003***                    -0.0001***          
##                                                             (0.001)                      (0.00003)          
##                                                                                                             
## comm_stocks                                                -0.0004***                   -0.00001***         
##                                                            (0.00004)                     (0.00000)          
##                                                                                                             
## cost                                                         -0.006                                         
##                                                             (0.019)                                         
##                                                                                                             
## us_production                                              -0.007***                     0.0003**           
##                                                             (0.002)                      (0.0001)           
##                                                                                                             
## spr                                                        0.0002***                                        
##                                                            (0.00002)                                        
##                                                                                                             
## consumption                                                 -0.004*                                         
##                                                             (0.002)                                         
##                                                                                                             
## indstr_prod                                                -80.957***                                       
##                                                             (5.655)                                         
##                                                                                                             
## unempl                                                     -4.914***                     -0.095***          
##                                                             (0.601)                       (0.037)           
##                                                                                                             
## cad_usd_exch_bid                                           -1,432.331                   793.812***          
##                                                           (1,008.332)                    (243.457)          
##                                                                                                             
## cad_usd_exch_ask                                           1,307.920                    -798.343***         
##                                                           (1,008.481)                    (243.567)          
##                                                                                                             
## cpi                                                         2.309***                                        
##                                                             (0.183)                                         
##                                                                                                             
## reserves_fed                                               -0.0001***                                       
##                                                            (0.00002)                                        
##                                                                                                             
## ppi                                                        -0.701***                                        
##                                                             (0.065)                                         
##                                                                                                             
## indstr_prod2                                                 23.925                                         
##                                                             (15.839)                                        
##                                                                                                             
## month                                                       0.493**                                         
##                                                             (0.234)                                         
##                                                                                                             
## previous_price                                                                           0.973***           
##                                                                                           (0.005)           
##                                                                                                             
## sin1                            -1.148**                    3.578***                     0.139***           
##                                 (0.448)                     (0.890)                       (0.050)           
##                                                                                                             
## cos1                           -4.151***                   -8.714***                     -0.252***          
##                                 (0.458)                     (0.317)                       (0.069)           
##                                                                                                             
## sin2                             -0.087                    -1.230***                                        
##                                 (0.449)                     (0.435)                                         
##                                                                                                             
## cos2                            -0.924**                     0.073                                          
##                                 (0.456)                     (0.280)                                         
##                                                                                                             
## sin3                             -0.509                      0.084                                          
##                                 (0.449)                     (0.295)                                         
##                                                                                                             
## cos3                             0.339                       -0.069                                         
##                                 (0.456)                     (0.278)                                         
##                                                                                                             
## sin4                             0.481                       0.017                                          
##                                 (0.449)                     (0.208)                                         
##                                                                                                             
## cos4                             -0.460                    -1.394***                                        
##                                 (0.456)                     (0.288)                                         
##                                                                                                             
## sin5                             0.034                      -0.304*                                         
##                                 (0.447)                     (0.183)                                         
##                                                                                                             
## cos5                             -0.431                    -1.459***                                        
##                                 (0.458)                     (0.286)                                         
##                                                                                                             
## comm_stocks:cost                                            -0.00000                                        
##                                                            (0.00000)                                        
##                                                                                                             
## reserves:us_production                                     0.00000***                                       
##                                                            (0.00000)                                        
##                                                                                                             
## Constant                      -137.538***                  204.699***                     3.450*            
##                                 (4.310)                     (49.224)                      (1.979)           
##                                                                                                             
## ------------------------------------------------------------------------------------------------------------
## Observations                     2,466                       2,466                         2,466            
## R2                               0.515                       0.940                         0.994            
## Adjusted R2                      0.513                       0.939                         0.994            
## Residual Std. Error        15.841 (df = 2454)          5.600 (df = 2436)             1.695 (df = 2455)      
## F Statistic            237.058*** (df = 11; 2454) 1,312.432*** (df = 29; 2436) 43,986.520*** (df = 10; 2455)
## ============================================================================================================
## Note:                                                                            *p<0.1; **p<0.05; ***p<0.01

Additional linear models

Lasso

library(lars)
## Loaded lars 1.2
X <- model.matrix(lm2a)

y <- training$return
lasso <- lars(X, y, type = "lasso", trace = FALSE)
summary(lasso)
## LARS/LASSO
## Call: lars(x = X, y = y, type = "lasso", trace = FALSE)
##    Df   Rss      Cp
## 0   1 13497 -5.2705
## 1   2 13482 -6.0312
## 2   3 13482 -4.0474
## 3   4 13481 -2.2668
## 4   5 13472 -1.8637
## 5   6 13470 -0.2218
## 6   7 13467  1.2280
## 7   8 13466  3.0293
## 8   7 13464  0.7490
## 9   8 13460  2.0209
## 10  9 13459  3.7815
## 11 10 13458  5.5076
## 12 11 13457  7.4539
## 13 12 13456  9.2522
## 14 13 13455 11.0860
## 15 14 13450 12.0888
## 16 15 13449 13.9066
## 17 16 13447 15.6459
## 18 17 13447 17.5506
## 19 16 13444 15.1290
## 20 15 13442 12.7693
## 21 16 13438 13.8750
## 22 17 13434 15.3038
## 23 18 13427 15.9567
## 24 19 13427 17.9010
## 25 20 13426 19.7976
## 26 21 13426 21.6871
## 27 22 13421 22.8243
## 28 23 13417 24.0832
## 29 22 13415 21.6957
## 30 23 13411 23.0974
## 31 24 13411 24.9766
## 32 25 13411 26.9721
## 33 24 13409 24.6861
## 34 25 13408 26.4994
## 35 24 13406 24.1946
## 36 23 13406 22.1245
## 37 24 13406 24.0631
## 38 25 13404 25.8297
## 39 26 13404 27.6834
## 40 27 13403 29.5760
## 41 28 13401 31.2114
## 42 27 13400 29.1130
## 43 26 13400 27.0515
## 44 27 13399 28.8030
## 45 28 13398 30.5835
## 46 27 13397 28.4922
## 47 28 13396 30.4007
## 48 27 13396 28.3553
## 49 28 13396 30.2570
## 50 29 13394 31.8967
## 51 28 13393 29.6892
## 52 29 13392 31.5768
## 53 30 13376 30.5763
## 54 29 13375 28.5029
## 55 30 13370 29.6135
## 56 29 13370 27.6073
## 57 30 13369 29.4012
## 58 29 13369 27.3823
## 59 30 13367 29.0000
## 60 31 13367 31.0000
## 61 30 13367 29.0000
## 62 31 13367 31.0000

Principal component regression

library(pls)
## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:stats':
## 
##     loadings
PCR <- pcr(return ~ . - Price - previous_price , data = training, validation = "LOO")
summary(PCR)
## Data:    X dimension: 2466 28 
##  Y dimension: 2466 1
## Fit method: svdpc
## Number of components considered: 28
## 
## VALIDATION: RMSEP
## Cross-validated using 2466 leave-one-out segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            2.34    2.341    2.342    2.343    2.343    2.344    2.344
## adjCV         2.34    2.341    2.342    2.343    2.343    2.344    2.344
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       2.343    2.344    2.345     2.346     2.347     2.348     2.349
## adjCV    2.343    2.344    2.345     2.346     2.347     2.348     2.349
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        2.349      2.35     2.351     2.352     2.353     2.353
## adjCV     2.349      2.35     2.351     2.352     2.353     2.353
##        20 comps  21 comps  22 comps  23 comps  24 comps  25 comps
## CV        2.354     2.355     2.356     2.357     2.359      2.36
## adjCV     2.354     2.355     2.356     2.357     2.359      2.36
##        26 comps  27 comps  28 comps
## CV         2.36     2.359     4.683
## adjCV      2.36     2.359     2.834
## 
## TRAINING: % variance explained
##           1 comps    2 comps   3 comps   4 comps   5 comps   6 comps
## X       74.977266  90.565969  99.88253  99.98656   99.9962   99.9997
## return   0.002893   0.004697   0.04626   0.09734    0.1644    0.2407
##          7 comps  8 comps   9 comps  10 comps  11 comps  12 comps
## X        99.9999  100.000  100.0000  100.0000  100.0000  100.0000
## return    0.3287    0.365    0.3808    0.4049    0.4569    0.4584
##         13 comps  14 comps  15 comps  16 comps  17 comps  18 comps
## X       100.0000   100.000  100.0000  100.0000  100.0000  100.0000
## return    0.4619     0.489    0.5041    0.5042    0.5085    0.5119
##         19 comps  20 comps  21 comps  22 comps  23 comps  24 comps
## X       100.0000  100.0000  100.0000  100.0000  100.0000  100.0000
## return    0.6003    0.6003    0.6056    0.6062    0.6067    0.6116
##         25 comps  26 comps  27 comps  28 comps
## X       100.0000  100.0000  100.0000  100.0000
## return    0.6451    0.6723    0.8749    0.9095

Non-linear models

Generalized additive model

detach("package:mgcv")

library(gam)
gam1 <- gam::gam(return ~ s(date) + s(reserves) + s(comm_stocks) + s(spr) + s(us_production) + s(cost)  + s(indstr_prod) + s(unempl) + s(cad_usd_exch_bid) + s(cpi) + s(reserves_fed) + s(ppi) + s(indstr_prod2) + s(month), data = training)

par(mfcol = c(1,1))
# Due to conflicts between packages, I cannot show the results to the following two commands below. Therefore I show the plots at the beginning.
# plot(gam1, residuals = TRUE, pch =".", rugplot = FALSE)
# summary(gam1)

BART machine

library(bartMachine)
## Loading required package: rJava
## Loading required package: car
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: missForest
## Loading required package: itertools
## Loading required package: iterators
## Welcome to bartMachine v1.2.0! You have 0.53GB memory available.
set_bart_machine_num_cores(parallel::detectCores())
## bartMachine now using 2 cores.
bart <- bartMachine(X = training[,-(1:2)], y = training$return)
## bartMachine initializing with 50 trees...
## Now building bartMachine for regression ...
## evaluating in sample data...done

Boosting

library(gbm)
## Loading required package: survival
## Loading required package: lattice
## Loading required package: parallel
## Loaded gbm 2.1.1
boosted <- gbm(return ~ . + comm_stocks * cost  + reserves * us_production - date -expiry_date, data = training, interaction.depth = 4, shrinkage = 0.001)
## Distribution not specified, assuming gaussian ...

Predictions in the Testing Data

# lm0
testing$return_lm0 <- predict(lm0, newdata = testing)
testing$squarederror_lm0 <- with(testing, (return_lm0 - return)^2)
(average_SE_lm0 <- mean(testing$squarederror_lm0))
## [1] 5.511555
# lm2
testing$return_lm2 <- predict(lm2a, newdata = testing)
## Warning in predict.lm(lm2a, newdata = testing): prediction from a rank-
## deficient fit may be misleading
testing$squarederror_lm2 <- with(testing, (return_lm2 - return)^2)
(average_SE_lm2 <- mean(testing$squarederror_lm2))
## [1] 5.530715
# lm3a_subset
testing$return_lm3_subset <- predict(lm3a_subset, newdata = testing)
testing$squarederror_lm3_subset <- with(testing, (return_lm3_subset - return)^2)
(average_SE_lm3_subset <- mean(testing$squarederror_lm3_subset))
## [1] 6.250821
# lasso  
lasso_return <- predict(lasso, newx = X_test)$fit
average_SE_lasso <- colMeans( (testing$return - lasso_return) ^ 2 )
(average_SE_lasso <- min(average_SE_lasso))
## [1] 5.519184
predictions_lasso <- lasso_return[,which.min(average_SE_lasso)]

# PCR
testing$PCR_return <- predict(PCR, newdata = testing)
average_SE_PCR <- colMeans( (testing$return - testing$PCR_return) ^ 2 )
(average_SE_PCR <- min(average_SE_PCR))
## [1] 5.49972
predictions_PCR <- testing$PCR_return[ , , which.min(average_SE_PCR)]

error_PCR <- predictions_PCR - testing$return

# Bart
testing$bart_return <- predict(bart, new_data = testing[,3:31])
testing$squarederror_bart <- with(testing, (bart_return - return)^2)
(average_SE_bart <- mean(testing$squarederror_bart))
## [1] 9.552035
# boosted
testing$boosted_return <- predict(boosted, newdata = testing, type = "response", n.trees = 100)
testing$squarederror_boost <- with(testing, (boosted_return - return)^2)
(average_SE_boosted <- mean(testing$squarederror_boost))
## [1] 5.520387
# gam
testing$return_gam <- predict(gam1, newdata = testing)
testing$squarederror_gam <- with(testing, (return_gam - return)^2)
(average_SE_gam <- mean(testing$squarederror_gam))
## [1] 6.668476
# compare to a guess of returns = 0 for each day.
testing$squarederror_return0 <- with(testing, (return)^2)
(average_SSE_return0 <- mean(testing$squarederror_return0))
## [1] 5.520292

Ranking models and discussion

According the average SE criteria (selected as the cost function in this project), the best model is the BART machine, which greatly outranks all others.

Note that the BART model predicted the worst. This is because BART assumes independent obeservations and therefore it is not well suited for timeseries data with autocorrelation. Further, the dataset has also been modified so that there isn’t any missing values and therefore, one of the benefits of BART - dealing with missing data - is not taken advantage of here.

The full ranking is:

Rank Number Model Name Average Squared Errors
1 Principal Components Regression 5.49972
2 Bivariate regression of return on time 5.511555
3 Lasso 5.519184
4 Guessing a return of zero 5.520292
5 Boosting 5.520741
6 Multivariate Regression with all inputs except prior price 5.530715
7 Subset of multivariate linear regression with all inputs (using step function) 6.250821
8 General Additive Model 6.668476
9 Bart Machine \(9.552035\)

Final Visualizations

Plots of the best model (PCR)

library(ggplot2)

ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = predictions_PCR), colour = "green" ) + ggtitle("Principal Component Regression in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")

ggplot(data = testing, aes(x = date)) + geom_line(aes(y = error_PCR)) + ggtitle("Plot of errors in Testing Data") + ylab("Error") + xlab("Time")

Plots of predictions in the testing data by other models (for contrast)

ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = bart_return), colour = "green" ) + ggtitle("BART machine in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")

ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = return_lm0), colour = "green" ) + ggtitle("Bivariate Regression in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")

ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = boosted_return), colour = "green" ) + ggtitle("Boosted Model in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")

ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = return_gam), colour = "green" ) + ggtitle("GAM Model in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")

ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = return_lm2), colour = "green" ) + ggtitle("Multivariate Regression 1 in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")

ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = return_lm3_subset), colour = "green" ) + ggtitle("Multivariate Regression 2 in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")

ggplot(data = testing, aes(x=date))+
geom_line(aes(y = return)) +
geom_line(aes(y = predictions_lasso), colour = "green" ) + ggtitle("Lasso in Testing Data") + ylab("Returns: actual (black) vs predicted (green)")